Systems and methods for detecting anomalies

ABSTRACT

Apparatus and method for detecting anomalies in a computer system are disclosed herein. In some embodiments, multiple probes are executed on an evolving data set. Each probe may return a result. Property values are then derived from a respective result returned by a corresponding probe. Surprise scores corresponding to the property values are generated, where each surprise score is generated based on a comparison between a corresponding property value and historical property values. The corresponding property value and the historical property values are derived from results returned from the same probe. Historical surprise scores generated by the anomaly detection engine are accessed. Responsive to a comparison between the plurality of surprise scores and the plurality of historical surprise scores, a monitoring system is alerted of an anomaly regarding the evolving data set.

PRIORITY INFORMATION

This application is a continuation of and claims the benefit of priority to U.S. patent application Ser. No. 14/143,185, filed on Dec. 30, 2013, which is hereby incorporated by reference herein in its entirety.

TECHNICAL FIELD

The present invention relates generally to data processing, and in some embodiments, to detecting anomalies in computer-based systems.

BACKGROUND

In many cases, enterprises maintain and operate large numbers of computer systems (e.g., servers) that may each run a layered set of software. In some cases, these computer systems provide functionality for the operation of the enterprise or to provide outbound services to their customers. In many cases, the enterprise may monitor the hardware and software layers of these servers by logging processing load, memory usage, and many other monitored signals at frequent intervals.

Unfortunately, the enterprise may occasionally suffer disruptions, where some of its services were degraded or even completely unavailable to customers. To resolve these disruptions, the enterprise will perform a post-mortem analysis of the monitored signal in an effort to debug the system. For example, the enterprise may analyze the memory usage to identify a program the may be performing improperly, or view the processing load to determine whether more hardware is needed.

Thus, traditional systems may utilize methods and systems for addressing anomalies that involves debugging a computer system after the anomaly has affected the computer system, and, by extension, the users.

BRIEF DESCRIPTION OF THE DRAWINGS

Some embodiments are illustrated by way of example and not limitations in the figures of the accompanying drawings, in which:

FIG. 1 illustrates a block diagram depicting a network architecture of a system, according to some embodiments, having a client-server architecture configured for exchanging data over a network.

FIG. 2 illustrates a block diagram showing components provided within the system of FIG. 1 according to some embodiments.

FIG. 3 is a diagram showing sampled values of a number of searches performed on a computer system that are observed over a time period, such as a twenty-four hour period, according to an example embodiment.

FIG. 4 is a diagram showing additional sampled values from two additional days, as compared to FIG. 3, according to an example embodiment.

FIG. 5 is diagram of a plot of metric data over time for a metric of a computer system, according to an example embodiment.

FIG. 6 is a histogram charting surprise scores from a number of queries submitted to a computer system, according to an example embodiment.

FIG. 7 is another histogram showing the surprise scores according to a logarithmic function, according to an example embodiment.

FIG. 8 is a histogram showing quantiles for a metric over a two-week period, according to an example embodiment.

FIG. 9 is a plot of the quantiles in a time series, according to an example embodiment.

FIG. 10 is a histogram that includes the quantiles shown in FIG. 8 but with a new quantile, according to an example embodiment.

FIG. 11 is a plot of the quantiles in a time series with a new quantile, according to an example embodiment.

FIG. 12 is a flowchart diagram illustrating a method for detecting an anomaly in a computer system, according to an example embodiment.

FIG. 13 is a diagram illustrating a property value table that may be generated based on executing the probes, according to an example embodiment.

FIG. 14 is a diagram showing property values for a probe-property type pair, according to an example embodiment.

FIG. 15 is a diagram illustrating a surprise score table, according to an example embodiment.

FIG. 16 is a chart showing of a measurement of a feature of the surprise scores generated by operation, according to an example embodiment.

FIG. 17 is a chart illustrating surprise score features over time, according to an example embodiment.

FIG. 18 illustrates a diagrammatic representation of a machine in the example form of a computer system within which a set of instructions may be executed to cause the machine to perform any one or more of the methodologies discussed herein.

The headings provided herein are for convenience only and do not necessarily affect the scope or meaning of the terms used.

DETAILED DESCRIPTION

Described in detail herein is an apparatus and method for detecting anomalies in a computer system. For example, some embodiments may be used to address the problem of how to monitor signals in a computer system to detect disruptions before they affect users, and to do so with few false positives. Some embodiments may address this problem by analyzing signals for strange behavior that may be referred to as an anomaly. Example embodiments can then scan multiple monitored signals, and raise an alert when the site monitoring system detects an anomaly.

Various modifications to the example embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and applications without departing from the scope of the invention. Moreover, in the following description, numerous details are set forth for the purpose of explanation. However, one of ordinary skill in the art will realize that the invention may be practiced without the use of these specific details. In other instances, well-known structures and processes are not shown in block diagram form in order not to obscure the description of the invention with unnecessary detail. Thus, the present disclosure is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

For example, in some embodiments, multiple probes (e.g., queries) are executed on an evolving data set (e.g., a listing database). Each probe may return a result. Property values are then derived from a respective result returned by one of the probes. A property value may be a value that quantifies a property or aspect of the result, such as, for example, a number of listing returned, a portion of classified listings, a measurement of the prices in a listing, and the like.

Surprise scores corresponding to the property values are generated, where each surprise score is generated based on a comparison between a corresponding property value and historical property values. The corresponding property value and the historical property values are derived from results returned from the same probe. Historical surprise scores generated by the anomaly detection engine are accessed. Responsive to a comparison between the plurality of surprise scores and the plurality of historical surprise scores, a monitoring system is alerted of an anomaly regarding the evolving data set.

FIG. 1 illustrates a network diagram depicting a network system 100, according to one embodiment, having a client-server architecture configured for exchanging data over a network. A networked system 102 forms a network-based publication system that provides server-side functionality, via a network 104 (e.g., the Internet or Wide Area Network (WAN)), to one or more clients and devices. FIG. 1 further illustrates, for example, one or both of a web client 106 (e.g., a web browser) and a programmatic client 108 executing on device machines 110 and 112. In one embodiment, the publication system 100 comprises a marketplace system. In another embodiment, the publication system 100 comprises other types of systems such as, but not limited to, a social networking system, a matching system, a recommendation system, an electronic commerce (e-commerce) system, a search system, and the like.

Each of the device machines 110, 112 comprises a computing device that includes at least a display and communication capabilities with the network 104 to access the networked system 102. The device machines 110, 112 comprise, but are not limited to, remote devices, work stations, computers, general purpose computers, Internet appliances, hand-held devices, wireless devices, portable devices, wearable computers, cellular or mobile phones, portable digital assistants (PDAs), smart phones, tablets, ultrabooks, netbooks, laptops, desktops, multi-processor systems, microprocessor-based or programmable consumer electronics, game consoles, set-top boxes, network PCs, mini-computers, and the like. Each of the device machines 110, 112 may connect with the network 104 via a wired or wireless connection. For example, one or more portions of network 104 may be an ad hoc network, an intranet, an extranet, a virtual private network (VPN), a local area network (LAN), a wireless LAN (WLAN), a wide area network (WAN), a wireless WAN (WWAN), a metropolitan area network (MAN), a portion of the Internet, a portion of the Public Switched Telephone Network (PSTN), a cellular telephone network, a wireless network, a WiFi network, a WiMax network, another type of network, or a combination of two or more such networks.

Each of the device machines 110, 112 includes one or more applications (also referred to as “apps”) such as, but not limited to, a web browser, messaging application, electronic mail (email) application, an e-commerce site application (also referred to as a marketplace application), and the like. In some embodiments, if the e-commerce site application is included in a given one of the device machines 110, 112, then this application is configured to locally provide the user interface and at least some of the functionalities with the application configured to communicate with the networked system 102, on an as needed basis, for data and/or processing capabilities not locally available (such as access to a database of items available for sale, to authenticate a user, to verify a method of payment, etc.). Conversely if the e-commerce site application is not included in a given one of the device machines 110, 112, the given one of the device machines 110, 112 may use its web browser to access the e-commerce site (or a variant thereof) hosted on the networked system 102. Although two device machines 110, 112 are shown in FIG. 1, more or less than two device machines can be included in the system 100.

An Application Program Interface (API) server 114 and a web server 116 are coupled to, and provide programmatic and web interfaces respectively to, one or more application servers 118. The application servers 118 host one or more marketplace applications 120 and payment applications 122. The application servers 118 are, in turn, shown to be coupled to one or more databases servers 124 that facilitate access to one or more databases 126.

The marketplace applications 120 may provide a number of e-commerce functions and services to users that access networked system 102. E-commerce functions/services may include a number of publisher functions and services (e.g., search, listing, content viewing, payment, etc.). For example, the marketplace applications 120 may provide a number of services and functions to users for listing goods and/or services or offers for goods and/or services for sale, searching for goods and services, facilitating transactions, and reviewing and providing feedback about transactions and associated users. Additionally, the marketplace applications 120 may track and store data and metadata relating to listings, transactions, and user interactions. In some embodiments, the marketplace applications 120 may publish or otherwise provide access to content items stored in application servers 118 or databases 126 accessible to the application servers 118 and/or the database servers 124. The payment applications 122 may likewise provide a number of payment services and functions to users. The payment applications 122 may allow users to accumulate value (e.g., in a commercial currency, such as the U.S. dollar, or a proprietary currency, such as “points”) in accounts, and then later to redeem the accumulated value for products or items (e.g., goods or services) that are made available via the marketplace applications 120. While the marketplace and payment applications 120 and 122 are shown in FIG. 1 to both form part of the networked system 102, it will be appreciated that, in alternative embodiments, the payment applications 122 may form part of a payment service that is separate and distinct from the networked system 102. In other embodiments, the payment applications 122 may be omitted from the system 100. In some embodiments, at least a portion of the marketplace applications 120 may be provided on the device machines 110 and/or 112.

Further, while the system 100 shown in FIG. 1 employs a client-server architecture, embodiments of the present disclosure is not limited to such an architecture, and may equally well find application in, for example, a distributed or peer-to-peer architecture system. The various marketplace and payment applications 120 and 122 may also be implemented as standalone software programs, which do not necessarily have networking capabilities.

The web client 106 accesses the various marketplace and payment applications 120 and 122 via the web interface supported by the web server 116. Similarly, the programmatic client 108 accesses the various services and functions provided by the marketplace and payment applications 120 and 122 via the programmatic interface provided by the API server 114. The programmatic client 108 may, for example, be a seller application (e.g., the TurboLister application developed by eBay Inc., of San Jose, Calif.) to enable sellers to author and manage listings on the networked system 102 in an off-line manner, and to perform batch-mode communications between the programmatic client 108 and the networked system 102.

FIG. 1 also illustrates a third party application 128, executing on a third party server machine 130, as having programmatic access to the networked system 102 via the programmatic interface provided by the API server 114. For example, the third party application 128 may, utilizing information retrieved from the networked system 102, support one or more features or functions on a website hosted by the third party. The third party website may, for example, provide one or more promotional, marketplace, or payment functions that are supported by the relevant applications of the networked system 102.

FIG. 2 illustrates a block diagram showing components provided within the networked system 102 according to some embodiments. The networked system 102 may be hosted on dedicated or shared server machines (not shown) that are communicatively coupled to enable communications between server machines. The components themselves are communicatively coupled (e.g., via appropriate interfaces) to each other and to various data sources, so as to allow information to be passed between the applications or so as to allow the applications to share and access common data. Furthermore, the components may access one or more databases 126 via the database servers 124.

The networked system 102 may provide a number of publishing, listing, and/or price-setting mechanisms whereby a seller (also referred to as a first user) may list (or publish information concerning) goods or services for sale or barter, a buyer (also referred to as a second user) can express interest in or indicate a desire to purchase or barter such goods or services, and a transaction (such as a trade) may be completed pertaining to the goods or services. To this end, the networked system 102 may comprise at least one publication engine 202 and one or more selling engines 204. The publication engine 202 may publish information, such as item listings or product description pages, on the networked system 102. In some embodiments, the selling engines 204 may comprise one or more fixed-price engines that support fixed-price listing and price setting mechanisms and one or more auction engines that support auction-format listing and price setting mechanisms (e.g., English, Dutch, Chinese, Double, Reverse auctions, etc.). The various auction engines may also provide a number of features in support of these auction-format listings, such as a reserve price feature whereby a seller may specify a reserve price in connection with a listing and a proxy-bidding feature whereby a bidder may invoke automated proxy bidding. The selling engines 204 may further comprise one or more deal engines that support merchant-generated offers for products and services.

A listing engine 206 allows sellers to conveniently author listings of items or authors to author publications. In one embodiment, the listings pertain to goods or services that a user (e.g., a seller) wishes to transact via the networked system 102. In some embodiments, the listings may be an offer, deal, coupon, or discount for the good or service. Each good or service is associated with a particular category. The listing engine 206 may receive listing data such as title, description, and aspect name/value pairs. Furthermore, each listing for a good or service may be assigned an item identifier. In other embodiments, a user may create a listing that is an advertisement or other form of information publication. The listing information may then be stored to one or more storage devices coupled to the networked system 102 (e.g., databases 126). Listings also may comprise product description pages that display a product and information (e.g., product title, specifications, and reviews) associated with the product. In some embodiments, the product description page may include an aggregation of item listings that correspond to the product described on the product description page.

The listing engine 206 also may allow buyers to conveniently author listings or requests for items desired to be purchased. In some embodiments, the listings may pertain to goods or services that a user (e.g., a buyer) wishes to transact via the networked system 102. Each good or service is associated with a particular category. The listing engine 206 may receive as much or as little listing data, such as title, description, and aspect name/value pairs, that the buyer is aware of about the requested item. In some embodiments, the listing engine 206 may parse the buyer's submitted item information and may complete incomplete portions of the listing. For example, if the buyer provides a brief description of a requested item, the listing engine 206 may parse the description, extract key terms and use those terms to make a determination of the identity of the item. Using the determined item identity, the listing engine 206 may retrieve additional item details for inclusion in the buyer item request. In some embodiments, the listing engine 206 may assign an item identifier to each listing for a good or service.

In some embodiments, the listing engine 206 allows sellers to generate offers for discounts on products or services. The listing engine 206 may receive listing data, such as the product or service being offered, a price and/or discount for the product or service, a time period for which the offer is valid, and so forth. In some embodiments, the listing engine 206 permits sellers to generate offers from the sellers' mobile devices. The generated offers may be uploaded to the networked system 102 for storage and tracking.

In a further example embodiment, the listing engine 206 allows users to navigate through various categories, catalogs, or inventory data structures according to which listings may be classified within the networked system 102. For example, the listing engine 206 allows a user to successively navigate down a category tree comprising a hierarchy of categories (e.g., the category tree structure) until a particular set of listing is reached. Various other navigation applications within the listing engine 206 may be provided to supplement the searching and browsing applications. The listing engine 206 may record the various user actions (e.g., clicks) performed by the user in order to navigate down the category tree.

Searching the networked system 102 is facilitated by a searching engine 208. For example, the searching engine 208 enables keyword queries of listings published via the networked system 102. In example embodiments, the searching engine 208 receives the keyword queries from a device of a user and conducts a review of the storage device storing the listing information. The review will enable compilation of a result set of listings that may be sorted and returned to the client device (e.g., device machine 110, 112) of the user. The searching engine 208 may record the query (e.g., keywords) and any subsequent user actions and behaviors (e.g., navigations, selections, or click-throughs).

The searching engine 208 also may perform a search based on a location of the user. A user may access the searching engine 208 via a mobile device and generate a search query. Using the search query and the user's location, the searching engine 208 may return relevant search results for products, services, offers, auctions, and so forth to the user. The searching engine 208 may identify relevant search results both in a list form and graphically on a map. Selection of a graphical indicator on the map may provide additional details regarding the selected search result. In some embodiments, the user may specify, as part of the search query, a radius or distance from the user's current location to limit search results.

The searching engine 208 also may perform a search based on an image. The image may be taken from a camera or imaging component of a client device or may be accessed from storage.

In addition to the above described modules, the networked system 102 may further included an anomaly detection engine 212 and a probe module 210 to perform various anomaly detection functionalities or operations as set forth in greater detail below.

Anomaly Detection

As explained above, some example embodiments may be configured to detect anomalies in an evolving data set by comparing surprise scores of property values received from a probe module. However, before describing the methods and systems for detecting anomalies in a computer system in great detail, some simplified examples of analyzing property values are now described to highlight some potential aspects addressed by example embodiments. For example, as a warm-up problem, consider a signal from a high software layer: the number of searches (“srp”) received or performed by the networked system 102 of FIG. 1. In some cases, srp may be tracked by the probe module 210 periodically, say, for example, every two minutes. FIG. 3 is a diagram showing sampled values 300 of srp observed over a time period, such as a twenty-four hour period, according to an example embodiment. The vertical axis range may represent sampled values of the number of searches performed over a two minute period, whereas the horizontal axis may represent time, which ranges from midnight to midnight, PDT. In analyzing the sampled values 300 of srp, the anomaly detecting engine 212 may identify that sampled values 302 and 304, occurring around 4:00 AM and 10:30 PM, respectively, are suspicious because the sample values 302 and 304 each exhibit a comparatively drastic deviation from their neighboring values. It is to be appreciated that traditional statistical methods may be unable to reliably determine if sampled values 302 and 304 should be considered anomalies based simply on the data found in the sampled value 300 shown in FIG. 3. That is, based on the sample values 300, prior art systems are unlikely to reliably determine (e.g., without issuing too many false positives) whether samples 302 and 304 are site disruptions or not.

But now consider FIG. 4, which shows additional sampled values for the number of searches per time period. For example, FIG. 4 is a diagram showing sampled values 400 that includes sampled values from the prior two days, relative to the sampled values 300 of FIG. 3, according to an example embodiment. Based on the sampled values 400, one may reasonably conclude that the sampled value 302 should be categorized as an anomaly, but not the sampled value 304. Such is the case because the sampled value 304 is consistent with the other two days of samples, whereas the sampled value 302 is inconsistent with the other two days.

FIGS. 3 and 4 suggest that comparing a property value of a result against historical property values may be used as a simple feature for detecting anomalies in a computer system. The feature may be based on comparing a current value of srp with its respective value 24 hours ago, 48 hours ago, 92 hours ago, and so forth.

Another example of detecting anomalies in a computer system is now described with reference to FIGS. 5-11. In this example computer system, the probe module 210 may periodically issue a query (or a set of queries) and log values for one or more properties relating to the search result returned by the query (or each query in the set of queries). Examples of properties that may be tracked by the probe module 210 include a number of items returned from a search query, the average list price, a measurement of the number of items that are auctions relative to non-auction items, a number of classified listings, a number of searches executed, etc. The anomaly detection engine 212 may repeatedly cycle through a fixed set (e.g., tens, hundreds, thousands, and so forth) of queries to build a historical model of the property values for each of the search queries over time.

FIG. 5 is diagram of a plot of property values 500 over time for a property of a computer system, according to an example embodiment. The property values may include one or more sampled values of for a property, which are sampled over time. The horizontal axis of FIG. 5 represents time, with the right-hand side representing the most recent samples. The vertical axis of FIG. 5 represents values of the property being monitored by the anomaly detecting engine 212. By way of example and not limitation, the property values 500 may represent the median sales price for the listings returned when the probe module 210 submits the search query “htc hd2” to the searching engine 208. As shown in FIG. 5, the plot may include a fitted line 504 to represent expected values from the property over time. As may be appreciated from FIG. 5, the property values 500 exhibits some noise (e.g., values that deviate from the fitted line 504). However, even when compared to the noise within the property values 500, the property value 502 may represent an anomaly because the deviation of the property value 502 is deviates significantly from the fitted line 504 when compared to the other values of the property values 500.

In some cases, the anomaly detecting engine 212 may determine whether a value of a property represents an anomaly caused by a site disruption based in part on calculating surprise scores for the property value. A surprise score may be a measurement used to quantify how out of the norm a value for a property is based on historical values for that property. For example, the anomaly detecting engine 212 may quantify the surprise score for a value of a property by computing the (unsigned) deviation of each property value from an expected value. For example, one specific implementation of calculating a surprise score may involve dividing the deviation of a value from the expected value (e.g., the fitted line 504) by the median deviation of all the values. Assuming the deviation for the value 502 is 97.9 and the median deviation for all the values of the property values 500 is 13.4, the anomaly detecting engine 212 may assign the value 502 a surprise score of 7.3 (e.g., 97.9/13.4).

Some embodiments may address the issue of whether a particular surprise score (e.g., a surprise score of 7.3, as discussed above) should trigger an alert that there may be an anomaly in the computer system. FIG. 6 is a histogram charting surprise scores from a number of queries submitted to a computer system, according to an example embodiment. In the context of FIG. 6, a surprise score of 7 is not unusual because a surprise score of 7 is not far off in value from other surprise scores. In fact, according to FIG. 6, there are many other queries that result in surprise scores that are of higher value than 7.

To clarify surprise scores shown in FIG. 6, FIG. 7 is another histogram showing the surprise scores according to a logarithmic function, according to an example embodiment. Since log(7.3)≈2, it is clear that a value of 2 is not all that unusual. Quantitatively, the percentile of the surprise score for the query “htc hd2” is about 96%. Ringing an alarm for a surprise this large may generate a large number of false positives.

Incidentally, FIGS. 6 and 7 are diagrams illustrating the difficulty in getting a low false positive rate when using statistical methods for detecting anomalies, according to an example embodiment. An example system may, for example, periodically execute 3000 queries six times a day to log measure data relating to 40 different properties. Thus, under these constraints, the anomaly detection engine 212 may generate 3000×40×6=720,000 graphs or tables each day. Even achieving as low as 1 false positive per day would require only triggering on graphs with a surprise score so high that it happens 0.00014% of the time. It is to be appreciated that such a stringent cutoff is likely to overlook many real disruptions.

One way around the difficulty of avoiding false positive may be through aggregation of surprising values across multiple queries. Since there will always be a few queries with a high surprise, the anomaly detection engine 212 can construct a feature based on the number of surprise scores that deviate from historical norms. A sudden change in the number of high surprise scores, for example, might be a good indicator of a site disruption. This is done separately for each property being monitored by the anomaly detection engine 212. To make this quantitative, instead of counting the number of queries with a high surprise, some embodiments of the anomaly detection engine 212 can examine a quantile (e.g., 0.9^(th) quantile) of the surprise values for a property. Using the quantiles to detect anomalies is now described.

The surprise score of the most recent property value, as computed above with reference to FIG. 5, depends on at least the following: a property type (e.g., mean sales price listed), a property value (e.g., a value for the mean sales price listed), the probe (e.g., a query that generates a result of listed items for sale), and a collection window of recent property values for the probe. When the surprise scores for each probe are calculated, the quantile of the surprise scores may be calculated. The quantile is computed by picking a property type and collection window, gathering up the surprise scores for the property type, across all the probes, within the collection window, and then taking the 90% quantile of those surprise scores. So there is a quantile for each property type-collection period pair. Every four hours the following process is performed for each property being monitored: rerun the set of queries again to obtain current property values for each query with respect to the property, recompute the surprise scores for the values obtained by rerunning the set of queries, and then determine the 90% quantile for these surprise scores. This gives a new value for the 90% quantile for the property which may be compared against the historical quantile to determine whether an anomaly exists.

FIG. 8 is a histogram showing quantiles 800 for a property type (e.g., a median sale price) over a two-week period, according to an example embodiment. As shown in FIG. 8, the quantiles are clustered near 3.6, with a range from 3.0 to 4.6. In another view, FIG. 9 is a plot of the quantiles in a time series, according to an example embodiment. This shows the feature (e.g., the quantile of the surprise score) is fairly smooth, and might be, in some cases, a candidate for anomaly detection.

An example of an anomaly that corresponds to a genuine disruption is now described. FIG. 10 is a histogram that includes the quantiles 800 shown in FIG. 8 but with a new quantile 1002, according to an example embodiment. The new quantile 1002 may be calculated based on the value of the property when the anomaly detection engine 212 executes the set of queries again. In another view, FIG. 11 is a plot of the quantiles in a time series with the new quantile 1002, according to an example embodiment. For example, while FIG. 9 shows quantiles from up to 07:00 on November 28, FIG. 10 adds the quantile 1002. The following observations are made. FIG. 9 is vaguely normal with a mean of 3.7 and a standard deviation of 0.3. The new value of the new quantile 1002 shown in FIGS. 10 and 11 is 29.6, which is 129.6−3.71/0.3≈86 standard deviations from the mean quantile. So the historical value of the quantile appears to be a useful feature for detecting anomalies.

Summarizing the above, it is expected that an individual property for a particular query will have sudden jumps in values. Although these sudden jumps may represent outliers, an outlier, in and of itself, should not necessarily raise an alert. Instead, example embodiments may use the number of queries that have such jumps as a good signal for raising an alert of an anomaly. So a selection of features may go like this. For each (probe, property value) pair, we compute a measure of surprise and measure whether the latest property value of the property is an outlier. The anomaly detection engine 212 then has a surprise number for each query. It is expected to have a few large surprise numbers, but not too many. To quantify this, the anomaly detection engine 212 may in some embodiments select the 90th quantile of surprise values (e.g., sort the surprise values from low to high and, return the 90-th value, or using a non-sorting function to calculate a quantile or ranking of surprise scores). This is our feature. Now we can use any outlier detection method to raise an alert. For example, in an example embodiment, the anomaly detection engine 212 may take the last 30 days' worth of signals, compute their mean and standard deviation. If the latest quantile of the signal is more than threshold deviation (e.g., 5σ from the mean), the anomaly detection engine 212 raises an alert.

A method for detecting an anomaly in a computer system is now described in greater detail. For example, FIG. 12 is a flowchart diagram illustrating a method 1200 for detecting an anomaly in a computer system, according to an example embodiment.

As FIG. 12 illustrates, the method 1200 may begin at operation 1202 when the probe module 210 executes probes on an evolving data set. Each probe returns a result derived from the evolving data set. By way of example and not limitation, the probe module 210 may issue a set of queries to the searching engine 208 of FIG. 2. The probe module 210 may then receive search results for each of the queries issued to the search engine 208. It is to be appreciated that each of the probes (e.g., search queries) may be different and, accordingly, each results may also be different.

In some embodiment, as part of operation 1202, the probe module 210 is further configured to derive property values for each result returned from the probes. As discussed above, a property value may include a data that quantifies a property or aspect of a result. To illustrate, again by way of example and not limitation, where the probe module 210 is configured to transmit a set of queries to the searching engine 208, the property value may represent, for example, a value for the property of the number of items returned in the result, the average list price in the result, a measurement of the number of items that are auctions relative to non-auction items in the result, a number of classified listings in the result, or any other suitable property.

Thus, in some embodiments, the execution of operation 1202 may result in a data table that includes a number of property values that each correspond to one of the probes executed by the probe module 210. Further, as the probe module 210 may monitor more than one property type, the table may include multiple columns, where each column corresponds to a different property type. This is shown in FIG. 13. FIG. 13 is a diagram illustrating a property value table 1300 that may be generated based on executing the probes, according to an example embodiment. The property value table 1300 may store the property values (organized by property types 1304) collected for a single iteration of the probes 1302. As FIG. 13 shows, for a single probe, the probe module 210 may derive property values for multiple property types. Further, for a single property type, the probe module 210 may derive multiple property values, each corresponding to a different probe. Thus, single property value may be specific to a probe-property type pair. For example, property value 1310 may be specific to the Probe₂-Property Type₂ pair.

With reference back to FIG. 12, at operation 1204, the anomaly detecting engine 212 may generate surprise scores for each of the property values. Each surprise score may be generated based on a comparison of a property value and historical property values that correspond to a probe-property type pair. For example, for a probe, the surprise score may be based on a function of the property value and a deviation from an expect value. An expected value may be determined based on a robust estimation of the tendency of the historical property values for that probe. A median, mode, mean, and trimmed mean are all examples of a robust estimation of the tendency of the value for the feature that can be used to generate a surprise score. In some cases, the surprise score for a value may be based on a standard deviation from the tendency for that value of the property. Thus, operation 1204 may generate a surprise score for the latest results where each surprise corresponds to one of the queries in the set of queries. For example, FIG. 14 is a diagram showing property values 1400 for a probe-property type pair, according to an example embodiment. As Shown in FIG. 14, the property values 1400 may include the property value 1310 received as part of a current iteration of executing the probes. As discussed with respect to FIG. 13, the property value 1310 may be specific to the Probe₂-Property₂ pair. The property values 1400 may also include historical property values 1402 that were obtained in past iterations of executing the probes. The historical property values 1402 are specific to the same probe-property type pair as the property value 1310 (e.g., Probe₂-Property₂ pair). The surprise score is based on the deviation of the sample property value 1310 from the historical property values 1402.

As discussed above with respect to the operation 1204 shown in FIG. 12, a surprise score is generated for each probe-property type pair. This is shown in FIG. 15 as FIG. 15 is a diagram illustrating a surprise score table 1500, according to an example embodiment. The surprise score table 1500 includes a surprise score for each of the probe property pairs. The surprise score table 1500 is generated based on calculating a surprise score in the manner discussed with respect to FIG. 14. That is, a surprise score is generated for each probe-property type pair based on a comparison between the property value corresponding to the probe-property type pair and the historical property values for the probe-property type pair.

With reference back to FIG. 12, at operation 1206, the anomaly detecting engine 212 may access a plurality of historical surprise scores generated by the anomaly detection engine 212. In some cases, the historical surprise scores accessed at operation 1206 may be based on past iterations of executing the probes. Further, in some cases, the historical surprise scores may be specific to a particular property type.

At operation 1208, responsive to a comparison between the plurality of surprise scores and the plurality of historical surprise scores, the anomaly detection engine 212 may alert a monitoring system of an anomaly regarding the evolving data set. With momentary reference to FIG. 15, the idea of operation 1208 is that an alert is generated if the surprise scores for a given property type (e.g., Property₂), across all probes (e.g., Probes₁₋₆) is out of the norm from historical surprise scores for those probe-property type pairs.

The comparison used by operation 1208 may be based a feature derived from the surprise scores. To illustrate, FIG. 16 is a chart showing of a measurement of a feature of the surprise scores generated by operation 1204, according to an example embodiment. The measurement of the feature shown in FIG. 16 is for the surprise scores generated for a single iteration of the execution of the probes. For example, the chart 1600 may measure the feature for the surprise scores generated across Probes₁₋₆ for Property Type₂. For example, the feature may be a measurement of deviation a surprise score is from an expected value. An expected value for the surprise score may be calculated based on a robust estimation of the tendency of the historical surprise scores for that property type. In some cases, the feature may be a quantile of the surprise scores, or a quantile of the data derived from the surprise scores (e.g., the deviation from the expected value). Still further, in some cases, the feature may be a measurement or count of the number of surprise scores that deviate beyond a threshold amount from the expected value, or that exceed a fixed surprise score.

As part of operation 1208, the feature of the surprise scores may then be compared against historical surprise scores from past iterations of executing the probes. This is shown in FIG. 17. FIG. 17 is a chart illustrating surprise score features 1700 over time, according to an example embodiment. For example, the surprise score feature 1702 may be a feature of the surprise scores for a current iteration of the execution of the probes, while the historical surprise score features 1704 are features of surprise scores from past iterations of executing the probes. Here, operation 1208 may alert if the feature of the surprise score 1702 deviates from the historical surprise score features 1704 beyond a threshold amount.

It is to be appreciated that the operations 1206 and 1208 shown in FIG. 12 may be repeated across all the property types in monitored by the probe module 210. For example, with reference to FIG. 5, the operations 1206 and 1208 may execute across Property Types₁₋₅.

It is to be further appreciated that although much of this disclosure discusses anomaly detection in the context of a search engine, other example embodiments may use the anomaly detection methods and systems described herein to detect anomalies in other types of computer systems. For example, the computer system may be an inventory data store. In such a case, the probe module 210 may be configured to detect as property types, among other things, the number of items stored per category, the number of auction items per category, and the like.

As another example, the computer system may be a computer infrastructure (e.g., a collection of computer servers). In such a case, the probe module 210 may be configured to detect as property types, among other things, a processor load, bandwidth consumption, thread count, running processes count, memory usage, throughput count, rate of disk seeks, rate of packets transmitted or received, or rate of response.

In other embodiments, the property values tracked by the anomaly detection engine 212 may include dimensions in addition to what is described above. For example, in the embodiment discussed above, one may conceptualize that a table may be used to store the property values, where the columns are the metrics tracked by the different probes modules, and the rows are the different values for those property types at different times. An extension would be a 3D-table or cube. For each (property, value) cell, there may be a series of aspects instead of a single number. An aspect may be a vertical stack out of the page. In example (1), the aspect might be different countries. Thus, a cell in a table may be related to a specific query (perhaps ‘iPhone 5S’) and property (perhaps number of results). But using the aspects, the results vary by country, so the single cell is replaced by a stack of entries, one for each country.

As mentioned in the previous section, the anomaly detection engine 212 may be configured to detect a problem with the search software (a disruption) before users do. In some cases, the property values may be received from the same interfaces and using the same computer systems used by the end users. In such cases, the metric data received from the probe module 210 is a proxy, if not identical, to the user experience of users of that computer system. Accordingly, it is to be appreciated that when this disclosure states the anomaly detection engine 212 may detect a problem before users do, it may simply mean that the anomaly detection engine 212 can detect and report a problem without intervention from a user. Thus, compared to traditional systems, example embodiments may use the anomaly detection engine 212 to provide comparatively quick detection of site problems.

Example Computer System

FIG. 18 shows a diagrammatic representation of a machine in the example form of a computer system 1800 within which a set of instructions, for causing the machine to perform any one or more of the methodologies discussed herein, may be executed. The computer system 1800 comprises, for example, any of the device machine 110, device machine 112, applications servers 118, API server 114, web server 116, database servers 124, or third party server 130. In alternative embodiments, the machine operates as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the machine may operate in the capacity of a server or a device machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine may be a server computer, a client computer, a personal computer (PC), a tablet, a set-top box (STB), a Personal Digital Assistant (PDA), a smart phone, a cellular telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

The example computer system 1800 includes a processor 1802 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both), a main memory 1804 and a static memory 1806, which communicate with each other via a bus 1808. The computer system 1800 may further include a video display unit 1810 (e.g., liquid crystal display (LCD), organic light emitting diode (OLED), touch screen, or a cathode ray tube (CRT)). The computer system 1800 also includes an alphanumeric input device 1812 (e.g., a physical or virtual keyboard), a cursor control device 1814 (e.g., a mouse, a touch screen, a touchpad, a trackball, a trackpad), a disk drive unit 1816, a signal generation device 1818 (e.g., a speaker) and a network interface device 1820.

The disk drive unit 1816 includes a machine-readable medium 1822 on which is stored one or more sets of instructions 1824 (e.g., software) embodying any one or more of the methodologies or functions described herein. The instructions 1824 may also reside, completely or at least partially, within the main memory 1804 and/or within the processor 1802 during execution thereof by the computer system 1800, the main memory 1804 and the processor 1802 also constituting machine-readable media.

The instructions 1824 may further be transmitted or received over a network 1826 via the network interface device 1820.

While the machine-readable medium 1822 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “machine-readable medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention. The term “machine-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media, and carrier wave signals.

It will be appreciated that, for clarity purposes, the above description describes some embodiments with reference to different functional units or processors. However, it will be apparent that any suitable distribution of functionality between different functional units, processors or domains may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processors or controllers may be performed by the same processor or controller. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality, rather than indicative of a strict logical or physical structure or organization.

Certain embodiments described herein may be implemented as logic or a number of modules, engines, components, or mechanisms. A module, engine, logic, component, or mechanism (collectively referred to as a “module”) may be a tangible unit capable of performing certain operations and configured or arranged in a certain manner. In certain example embodiments, one or more computer systems (e.g., a standalone, client, or server computer system) or one or more components of a computer system (e.g., a processor or a group of processors) may be configured by software (e.g., an application or application portion) or firmware (note that software and firmware can generally be used interchangeably herein as is known by a skilled artisan) as a module that operates to perform certain operations described herein.

In various embodiments, a module may be implemented mechanically or electronically. For example, a module may comprise dedicated circuitry or logic that is permanently configured (e.g., within a special-purpose processor, application specific integrated circuit (ASIC), or array) to perform certain operations. A module may also comprise programmable logic or circuitry (e.g., as encompassed within a general-purpose processor or other programmable processor) that is temporarily configured by software or firmware to perform certain operations. It will be appreciated that a decision to implement a module mechanically, in dedicated and permanently configured circuitry, or in temporarily configured circuitry (e.g., configured by software) may be driven by, for example, cost, time, energy-usage, and package size considerations.

Accordingly, the term “module” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), non-transitory, or temporarily configured (e.g., programmed) to operate in a certain manner or to perform certain operations described herein. Considering embodiments in which modules or components are temporarily configured (e.g., programmed), each of the modules or components need not be configured or instantiated at any one instance in time. For example, where the modules or components comprise a general-purpose processor configured using software, the general-purpose processor may be configured as respective different modules at different times. Software may accordingly configure the processor to constitute a particular module at one instance of time and to constitute a different module at a different instance of time.

Modules can provide information to, and receive information from, other modules. Accordingly, the described modules may be regarded as being communicatively coupled. Where multiples of such modules exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) that connect the modules. In embodiments in which multiple modules are configured or instantiated at different times, communications between such modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple modules have access. For example, one module may perform an operation and store the output of that operation in a memory device to which it is communicatively coupled. A further module may then, at a later time, access the memory device to retrieve and process the stored output. Modules may also initiate communications with input or output devices and can operate on a resource (e.g., a collection of information).

Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. One skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. Moreover, it will be appreciated that various modifications and alterations may be made by those skilled in the art without departing from the scope of the invention.

The Abstract is provided to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate embodiment. 

What is claimed is:
 1. A computer-implemented system comprising: a probe module implemented by one or more processors and configured to execute a plurality of probes on an evolving data set, each probe from the plurality of probes returning a result, the probe module further configured to derive a plurality of property values, each property value from the plurality of property values is derived from a respective result returned by a corresponding probe of the plurality of probes; and an anomaly detection engine implemented by the one or more processors and configured to: generate a plurality of surprise scores corresponding to the plurality of property values, each surprise score being generated based on a comparison of a corresponding property value from the plurality of property values and historical property values, the corresponding property value and the historical property values having been derived from results returned from the same probe, access a plurality of historical surprise scores generated by the anomaly detection engine, and responsive to a comparison between the plurality of surprise scores and the plurality of historical surprise scores, alert a monitoring system of an anomaly regarding the evolving data set. 